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ABSTRACT 

In this paper, we formulate a top-fc query that compares objects in 
a database to a user-provided query object on a novel scoring func- 
tion. The proposed scoring function combines the idea of attractive 
and repulsive dimensions into a general framework to overcome 
the weakness of traditional distance or similarity measures. We 
study the properties of the proposed class of scoring functions and 
develop efficient and scalable index structures that index the iso- 
lines of the function. We demonstrate various scenarios where the 
query finds application. Empirical evaluation demonstrates a per- 
formance gain of one to two orders of magnitude on querying time 
over existing state-of-the-art top-A: techniques. Further, a qualita- 
tive analysis is performed on a real dataset to highlight the potential 
of the proposed query in discovering hidden data characteristics. 

1. INTRODUCTION 

Top-fc queries play a critical role in various applications such 
as business intelligence analysis, e-commerce, and virtual screen- 
ing of molecular libraries. Typically, datasets for such tasks con- 
tain a large number of multidimensional objects. A top-fc query 
on such a dataset returns a subset of objects that maximize a par- 
ticular scoring function. A number of techniques exist that aim to 
answer top-fc queries efficiently [3,7-10, 12, 18,20]. However, most 
of them assume a monotonic scoring function. A function is called 
monotonic if f{x I, ..,Xn) < /(a;i, .., x^), whenever a;; < x'^Wi. 
For example, f{x, y) = a; + j/ is a monotonic function, whereas 
f{x, y) = x — yii not. Often situations are encountered where a 
monotonic function is not enough to identify the interesting objects. 
In this paper, we study a class of non-monotonic scoring functions 
that models a mixture of similarity on attractive dimensions and 
distance on repulsive dimensions to overcome the limitations of 
traditional similarity or distance measures. 

Consider the online advertising scenario, where an advertiser 
places its advertisements in publishers' pages such as Yahoo! or 
Google. In such an application, an advertiser is interested in two 
sets of information: the cost of advertising, and the hit rate on their 
advertisements. Typically, top publishers charge more since an ad- 
vertisement placed in their page is likely to get more hits. From 
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Figure 1: A sample database and query points 



an advertiser's perspective however, a high hit rate is attractive, 
whereas a high cost is repulsive. Thus, it is of interest to find those 
publishers that receive hit rates similar to that of a top publisher, 
and yet are cheaper. 

Top-fc queries of this nature also find application in scaffold hop- 
ping in the field of chemoinformatics. Scaffold hopping attempts 
to discover molecules that are structurally diverse from a given 
query molecule, but show similar binding activity. In chemical li- 
braries, molecules are represented as high dimensional points in 
the virtual space [1,5, 13, 14]. Further, the binding activity of a 
molecule against various targets is maintained as a feature vector. 
For scaffold hopping, given a query molecule, one needs to formu- 
late a function that aggregates similarity on dimensions represent- 
ing binding activity and distance on dimensions representing the 
structure. 

To formalize the idea further, consider Figure I that presents a 
sample zoological database where each point represents a species. 
In the study of species evolution, it is often of interest to zoologists 
to find similar species evolving in different regions of the world. 
Such a query can be answered if the scoring function computes sim- 
ilarity on phylogeny and distance on the evolutionary habitat. Thus, 
given a query species, the scoring function should return database 
points that are as distant as possible in habitat from the query and 
as similar as possible in phylogeny. Therefore, for qi , the desired 
top-1 answer is pi since its phylogeny is same as qi, but the habitat 
is vastly different. Although both p-z and ps reside in vastly dissim- 
ilar habitats, their phylogeny is distant from the query as well. Fol- 
lowing the same reasoning, for q2, the top-1 result is pa; no other 
species is as similar in its phylogeny and as distant in its habitat. 

The above function can be modeled in two ways. Assume p is 
the absolute distance in phylogeny between a data point and the 
query species, while h is the absolute distance in habitat between 
them. Thus, the top-fc most interesting data points can be computed 
using the function 



Score(data point, query) = h — p 



(1) 
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An alternative approach to model the same phenomenon is to 
take the ratio ^ between habitat and phylogeny. While the two 
functions do not behave in the exact same manner, they model 
the desired features. More importantly, they both share the non- 
monotonic behavior. The obvious question that arises from the 
above analysis is: Can a monotonic function be used that models 
the same phenomenon and leverages existing top-k techniques? 

The non-monotonicity of the ranking function stems from the 
need to incorporate a query object in the ranking procedure. Most 
of the existing top-fc querying techniques assume a monotonic rank- 
ing function and typically do not incorporate a user-provided query 
object. In our case, the user provided query-object is central to the 
problem, and the entire analysis is based on the distances in the 
individual dimensions between the query and the database points. 
Since computing distance is non-monotonic, the ranking functions 
inherit the non-monotonicity property as well. Certainly, alterna- 
tive formulations of the problem aimed at modeling similar prop- 
erties are possible. However, such formulations are bound to be 
non-monotonic as well due to the inherent need to capture the dis- 
tances between the query and the database points. 

Score(p, q) = (c - \py - qy\) + \px - Qx \ (2) 

Consider Eqn. 2 for example. Assume y is the repulsive dimen- 
sion and X is the attractive dimension. For query point q = {qx,qy) 
and a database point p = (p^, the score can be computed by 
aggregating the distance in the attractive dimension and the differ- 
ence between a large constant c, and the distance in the repulsive di- 
mension. In this formulation, a lower score indicates a higher rank. 
Even though the individual contributions from the dimensions are 
suimned, since the distance computation is non-monotonic, the for- 
mulation inherits the non-monotonicity property as well. 

The above discussion shows why a monotonic function is not 
suitable for the proposed problem of modeling attractive and repul- 
sive dimensions. Since computing the distance between a query 
and a data point on any dimension is non-monotonic, the ranking 
function that aggregates the individual distances is non-monotonic 
as well. Owing to this property, existing top-fc techniques that as- 
sume monotonic scoring functions are unable to solve the problem. 

Among existing top-fc techniques, only [15] and [19] have looked 
at non-monotonic functions. Both techniques propose general pur- 
pose pruning strategies on hierarchical index structures. Conse- 
quently, [15] and [19] are applicable to a wide range of functions. 
In our solution, we take a more direct approach and develop pre- 
computation based index structures specifically for the proposed 
class of linear scoring functions. Furthermore, both [15] and [19] 
assume disk-resident index structures. Consequently, the techniques 
focus on minimizing disk accesses and guaranteeing I/O optimality. 
Our technique assumes a main-memory setting due to the impres- 
sive growth rate of main-memory capacities and the advent of solid 
state drives where random accesses are far cheaper than traditional 
hard disks. 

In this paper, we formulate a query, called SD-Query, that ag- 
gregates similarity and distance into a single function. Two index 
structures are developed to answer top-1 and top-fc queries on non- 
monotonic scoring functions. The top-1 index structure allows us 
to take advantage of the fact that k is known apriori. For example, 
if a data analysis program is required to return only the best candi- 
date, then the top-1 index structure fits the application better. The 
second index structure is developed for the generic setting where 
the value of k is supplied at runtime. The major contributions of 
the paper are summarized as follows: 

1. Query formulation: We formulate the problem of top-fc query 



over a mixture of attractive and repulsive dimensions and demon- 
strate its usefulness in a number of applications. 
2. Index Structures: We develop two novel index structures to 
answer top-1 and top-fc queries on a non-monotonic scoring func- 
tion. Theoretical bounds guarantee a linear growth rate for the in- 
dex structures. Empirical results demonstrate a better performance 
by one to two orders of magnitude over existing techniques. 

2. PROBLEM FORMULATION 

In this section, we formally define the problem and introduce the 
concepts that are central to the techniques developed in the paper. 

Definition 1. SD-Query. Assume, we have a datasetF of 

multidimensional points pi=[pi^,..,p,i^^ ]. Given query q=[ qi,..,q,n], 
sets of dimensions D and § associated with sets of weight parame- 
ters a and /3 respectively, and an integer k, the goal is to find the k 
highest scoring points on the following function: 

SD-score(p, q) = ^ ai\pi - qi\ j3j \pj - qj | Vi € D, Vj £ § 
i i 

(3) 

As can be seen, the above fimction is repulsive towards dimen- 
sions in D and attractive towards dimensions in §. a; and fij act 
as weighting parameters to tune the relative importance of the di- 
mensions. The first summation in Eqn. 3 computes the Manhattan 
distance over repulsive dimensions that are desired to be as distant 
as possible, whereas the second summation computes the total dis- 
tance over attractive dimensions chosen to be similar. As a result, 
SDscore is directly proportional to the distances among repulsive 
dimensions and inversely proportional to the distances among at- 
tractive dimensions. 

To answer SD-Queries efficiently, the challenge is to develop 
pruning strategies that produce the answer set without scanning the 
entire dataset. We address that challenge by identifying the iso- 
lines of Eqn. 3 in the 2D plane and then designing a deterministic 
approach. 

Definition 2. Isoline. An isoline is defined as a line that 
connects points of equal value in the plane for a given query. 

For dimensions above two, we divide the query into subproblems 
of two dimensions, and then aggregate each of the subproblems 
to produce the answer set. A number of existing top-fc querying 
techniques [7,8,19] take a similar approach of solving subproblems 
optimally and then aggregating them to compute the final answer 
set. However, in these techniques, a subproblem consists of a single 
dimension, whereas in our approach a subproblem comprises of 
two dimensions. As a result, a better scalability against dimension 
is achieved. This result is verified in Section 6. 

Hereon, we assume the points to be 2-dimensional and develop 
pruning techniques for points in the {x, y) plane. We solve the 
problem for arbitrary dimensions in Section 5 by extending the 
techniques developed in Sections 3 and 4. Without loss of gen- 
erality, wc consider dimension x for similarity and y for distance. 
Therefore, for query q = [xg, y^], we try to maximize the following 
function: 

SD-score(p, q)= a|j/p - - P\xp - Xq\ \/p€ P (4) 

Example: If a = /3 = 1, for the sample database in Figure 1, 
SDscore {pi,qi) = 3—0 = 3 and SDscore{pi, q2) — 2 — — 2. 

Consider the example shown in Figure 2a, which shows points 
from a sample dataset and a given query point q. Assume both di- 
mensions are equally important resulting in a = ^ = 1. Given 
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SD-score(p, q) = \yp — Vp + \xq — Xp\ — c\ — \xp — Xq 
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Figure 2: (a) 45° degree projections of points on query q. (b) 
Illustration of the four possible projections from a point, (c) 
Illustration of the scenario when SD-score(p, g) < 



such a scenario, the solid Une across q represents its axis and the 
dashed lines emerging from the database points are termed as pro- 
jections. The projections are lines originating from data points piS 
at 45° angle. Due to this geometry, they have a special property 
with respect to q. For each point, the score with respect to q is the 
same as the score between q and the intersection point of its pro- 
jection with q's axis since these points he on the same isoline. For 
any arbitrary a and /?, the angle can be computed as: 



arctan — 
a 



(5) 



Definition 3. Axis. The axis of a query point q = [xq,yq] 
is defined by the line x = Xq. 

Definition 4. Projection. Projection of a point p is a line 
originating from p at an angle 9 computed using Eqn. 5. In the 2D 
plane, each point generates four projections in four directions. We 

call these projections left lower projection (abhr lip), right lower 
projection (abhr. rip), left upper projection (abbr. lup) and right 
upper projection (abbr. rup). An example is shown in Figure 2b. 
From geometric reasoning, projections of the same type (such as 
Up) are parallel to each other 

Example: In Figure 2a, the three shown projections are of types 
rip (pi). Hp (p2) and rup (ps). 

Hereon, for explanatory purposes and without loss of generality, 
we assume a = /B — 1 for simplicity resulting in ^ = 45°. All the 
theory developed in this paper holds for any arbitrary a and p. 

Note, in Figure 2a, only one of the four projections is shown for 
each point. However, as can be seen in Figure 2b, four projections 
exist for each point, and only one of them is the correct isohne with 
respect to a query. Thus, given query q, it is important to deter- 
ministically choose the correct projection for a point p. Towards 
that goal, we make the following observations. First, if Xq < Xp, 
we only need to decide between the left projections of p since the 
right projections will never intersect q's axis. Second, if yp < yq, 
then any of the lower projections will not provide the correct score. 
Analogous relationships exist for Xq > Xp and yp > yq. Based on 
these observations, we make the following claims: 

Claim 1. For query q= [xq , yq], if projections from p= [xp, yp] 
intersect q 's axis at points p" and p', and q is located between p" 
andp', then SD-score(p, q)< 0. An example is shown in Figure 2c. 

Proof: Due to geometric constraints, 

P" = [Xq,yp + \Xq - Xp\] 



V = [xq,yp - 
Thus, yq = yp 



Xp\] 
Xp 



+ c where < c < 21 j, „ 
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Now, let us define a function Projection{p, q) for points 
P=[xp, yp] and query q=[xq, yq]. 



Projection(p, q) ■■ 



if yp < yq and Xp > Xq 



lup, 

Up, 

rup, if yp < yq and Xp < Xq 
rip, if yp > yq and Xp < Xq 



(6) 



Hereon, whenever we refer to a projection of a point p on a query, 
we assume it to be the projection chosen according to Eqn. 6. 
Moreover, we refer to the intersection point between p's projection 

and q's axis as p's projected point on q. 

Let the intersection point between Projection(p, q) and q's axis 
bep'=[xq,yp']. 

Claim 2. Ifp does not satisfy the conditions in Claim I, then 

SD-score(p, q)=SD-score(p' , q). 

Proof: There are two possible cases, y'p > yq and y'p < yq. 
For Case 1: y'^>y^ (7) 

Since projections follow a 45° angle and p does not satisfy the 
conditions in Claim 1, 



Vp' = Vp 



(8) 



SD-score(p', q) 



■yq\ 



'\yp -yq\ - - ^p\^ if.2/p -yq> - xp\ 

JXq - Xp\ + Vq - Vp, if, Vp-Vq < \Xq - Xp\ 

— SD-scoreCp,^) 

since yp — yq < \xq — Xp\ contradicts Eqn. 7. 
Case 2: The proof follows analogously. □ 

Claim 3. Ifp satisfies conditions in Claim I, then 
SD-score(p, q) = -]yq - yp,] 

Proof: There are two possible cases, yp > yq and yp < yq. 
For Case I: yp >yq. 

Since a lower projection (rip or Up) would be selected, 

Vp' ~ Vp l^^q •^pI 
SD-score(p, q) = ]yq - yp] - ]xq - Xp] 

— Vp ~ Vq ~ \^q ~ ■^p] 

= -lv -yq\ 

Case 2: The proof follows analogously.' □ 
Claim 1 formalizes the conditions under which a point is guaran- 
teed to return a negative score. Claim 2 establishes the projections 
as the isolines of all data points with positive scores. For the re- 
maining points with negative scores, although their projections are 
not the isolines for a given query. Claim 3 shows that the scores can 
still be computed using just the projections. Next, we show that to 
compute the top-fc answer set for any given query, the search can 
be hmited to only those points that correspond to either the top-A; 
highest lower projections or the top-fe lowest upper projections on 
the query's axis. 



'For arbitrary a and /?, SD-score(p, q)= —a]ypi — yq] 
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Claim 4. For any query q, the top-k result is a subset of the 
points corresponding to the highest lower projections and the points 
corresponding to the lowest upper projections on q 's axis. 

Proof: With respect to q, any data point p can be divided into one 
of the two groups: j/p > j/g, and yp < yq. 

Group 1: For the group yp > yq, Projection(p, q) is always a lower 
projection (i.e. rip or lip). Now, based on whether p satisfies Claim 
1, Group 1 can further be divided into two subgroups: ypi > yq, 
and ypi < yq, where p' is p's projected point on q. 
Subgroup 1: ypi > yq. From Claim 2, SD-score(p, q) — ypi — yq. 
Subgroup 2: ypi < yq. From Claim 3, SD-score(p, q) — ypi — yq. 
Thus, for both subgroups, SD-score(p, q) increases with ypi . As a 
result, the top-fc answer set within this group is equivalent to the 
points with the top-fc highest lower projections on q. 
Group 2: For the group yp < yq, Projection(p, q) is always an 
upper projection. It can be analogously shown that SD-score(p, q) 
decreases with ypi as a result of which, the top-fc answer set is 
equivalent to the points with the lowest upper projections. 
Thus, under all situations, the top-fc answer set is a subset of the 
points corresponding to the top-fc highest lower projections or the 
top-fe lowest upper projections. □ 

Example: In Figure 2a, the highest lower projection and lowest 
upper projection on q come from pi (rip) andps (rup) respectively. 
Thus, if we are looking to compute the top-1, a comparison is re- 
quired only between points pi and ps. 

Claim 4 establishes that the search space for the top-fc answer 
set can be pruned drastically by analyzing the projections on the 
given query's axis. However, projecting the database points on the 
query's axis, and then determining the top-fc projections keep the 
cost linear. In the next two sections, we thus focus on how the 
discovered properties can be used to compute the top-fc answer set 
in sublinear time. 

3. INDEX STRUCTURE FOR TOP-1 

In this section, we develop an index structure under the assump- 
tion that fc and the weighting parameters a and /3 are known apri- 
ori. Although we make the assumption that k — a = /3 = 1, the 
techniques developed in this section are generalizable to arbitrary 
values of fc, a, and /3. The apriori knowledge of the parameters is 
used to design a highly compact and efficient index structure. The 
advantage of the apriori knowledge of the parameters in reducing 
storage and computation costs are quantified in Section 6. We gen- 
eralize the index structure for the case where the parameters are 
supplied at query time in the next section. 

Given the knowledge that fc = 1, using Claim 4, a point p is 
a candidate for top- 1 only if its projections on query g's axis is 
either the lowest upper projection or the highest lower projection. 
Although this result allows us to prune majority of the data points, 
the overhead of first projecting all data points on g's axis keeps 
the cost linear. Thus, we ask the question: can the two extreme 
projections on q be preserved and retrieved in a scalable manner? 

To answer the question, we analyze the scenario shown in Figure 
3. For simplicity, we consider just the lower projections. Consider 
query q2 . The highest lower projection on its axis is from pi . How- 
ever, for query gi, the highest lower projection is from p2. The 
result shows that the ordering between the lower projections has 
changed due to the shift in the location of the two queries. It can be 
seen that the ordering of a projection can change only if it intersects 
with another one. In this particular case, the intersection is between 
the rip of p2 and lip of pi . Further, by Claim 4 and due to geometric 
constraints, the highest projection on any query q can change only 
through an intersection between the rip and Up of two points, while 



•' Pi ';>:' 



Figure 3: Illustration of how the ordering between projections 
changes due to intersections. For simplicity, only the lower pro- 
jections are shown. 



the lowest projection can change due to an intersection between rup 
and lup. 

Example: Figure 3 provides a detailed illustration of how the 
ordering between projections changes. As shown in the figure, pi 
provides the highest projection in the shaded region between I\ and 
I2 for any given query. At Ii, the rip of p2 and Up ofpi intersect 
due to which the ordering changes. Thus, p2 provides the highest 
projection in the region left of I\ . Moreover, pi will never provide 
the highest projection either to the left or right of the shaded region 
due to the geometry of the projections. 

Based on the observations noted in the above example, we make 
the following claim: 

Claim 5. For any point p, there can exist at most one contin- 
uous region r, where p provides the highest lower projection. The 
statement also holds for lowest upper projections. 

Proof: Case 1: Let us first consider the case of highest lower 
projections. Assume p provides the highest lower projection in re- 
gion r. The boundaries of r are defined by vertical lines running 
through the points at which the Up or rip of p is intersected by a 
projection from some other point. Now, for p to provide the high- 
est lower projection again in a region outside r, one of its lower 
projections must become the highest again. However, we show that 
it is never possible. 

Subcase 1.1: Right lower projection of p becomes the highest pro- 
jection again. Assume the Up of some point p intersects the rip of 
p. To the right of the intersection point, the rip of p can never again 
be the highest lower projection since: 

1 . throughout its extent, Up of p remains above rip of p. 

2. when Up of p ends at p, rip of p starts and remains above rip of 
p throughout. 

The proofs of Subcase 1.2 for Up of p and Case 2 for lowest upper 
projections follow analogously. □ 

From Claim 5, it follows that in a dataset of n points, there can 
be at most n unique regions corresponding to points providing the 
highest lower projections and n more for lowest upper projections. 
The intersection of these regions results in at most 2n regions of 
unique pairs of highest and lowest projection providers. This result 
establishes that for any dataset, the 2D space can be divided into 
regions, such that in each region the two points corresponding to the 
highest and lowest projections remain static. Now, given any query 
q, the highest and lowest projections on its axis can be retrieved 
by finding the region in which g's axis lies. Once the region is 
found, computing the answer takes constant time since it involves 
comparing the points corresponding to the two extreme projections. 

Motivated by the above observation, we focus on developing an 
algorithm to efficiently index regions with unique pairs of high- 
est and lowest projection providers. First, we scan the space left 
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Algorithm 1 ConstructTop-lIndex(P) 

Require: P is a set of points 

1 : Piip <— points sorted in descending order of its lip on x = — oo 

2: Piup points sorted in ascending order of its lup on a; = — oo 

3: TopLlp Piip[l] 

4: TopLup ^ Pi^pU] 

5 : upper Index <— empty array 

6: lowerlndex ■;— empty array 

7: for j in ran(jre(2, |P|) do 

8: next <— Piipli] 

9: if Up of next intersects rip of TopLlp then 

10: in <r- intersection between rip of TopLlp and lip of next 

1 1 : aAd{upper Index, { TopLlp,Xin ) ) 

12: TopLlp <— nextTopLlp 

13: next <— Piupli] 

14: if lup of next intersects rup of TopLup then 

15: in <r- intersection between rup of TopLup and lup of next 

16: Sidd(lowerIndex,{TopLup,Xin)) 

17: TopLup <— nextTopLup 

18: add{upper Index, {TopLlp,oo)) 

19: add{lowerIndex,(TopLlp,oo)) 

20: return merge (upper /ndea;, lowerlndex) 



to right to identify points that correspond to a highest lower pro- 
jection, or a lowest upper projection within some region r. When 
such a point is found, the point and the boundary of region r are 
stored in the index structure. The resultant index structure can be 
viewed as a sorted array of regions. The first cell in the array corre- 
sponds to the leftmost region, whereas the last cell corresponds to 
the rightmost region. The boundary of a region is identified by the 
x-intercept of the vertical line that passes through the intersection 
of the projections between two consecutive regions. For higher val- 
ues of k, instead of tracking the highest and lowest projections, we 
need to track the fc-highest and fe-lowest projections. Further, any 
region where the ordering of the fe-highest or fc-lowest projections 
changes, needs to indexed. 

Alg. 1 presents the pseudocode for the index construction algo- 
rithm. First, two sorted lists are constructed to facilitate the line- 
sweep algorithm. The first fist arranges points in descending order 
of the lips when projected on the line x = — oo (line 1). As a re- 
sult, the first point in the list corresponds to the highest lip incident 
on a: = —oo. Similarly, the second list is ordered based on lups 
in ascending order (line 2). From each of the ordered lists, the top 
elements are fetched (Unes 3-4). Both top elements are guaranteed 
to be in the index, since they correspond to either the highest or 
lowest projection. In each of the next iterations, the next points in 
the lists are fetched. For the upper index, the next point is added 
if its Up intersects with rip of the current top element. Along with 
the point, the x-value of the intersection point is added. The verti- 
cal line through the intersection point defines the region boundary. 
Similar checks are made among upper projections to add points to 
the lower index (lines 7-17). Once the iteration completes, the up- 
per and lower index structures are merged. The merging procedure 
is same as merging two sorted arrays. Finally, the merged index 
structure is returned (line 20). 

Example: Consider Figure 3 as our sample database. For 
simplicity, we consider only the highest lower projections. Alg. 
1 first creates a sorted list based on the lips of all points on the 
line X = —oo. The first point in the list is p2 since it provides the 
highest projection. To determine the extent of the region where p-z 
provides the highest lower projection, the second point in the list, 
pi, is fetched to check if it intersects the rip ofp2. Since the check is 
true for pi, a cell is created for p2 which stores the point itself and 
the x-value of the intersection point Ii. The next point fetched from 



the sorted list is p4. However, p4 is discarded since its Hp does not 
intersect the rip ofpi. Continuing in the same manner, since pz's 
Up intersects the rip of pi, a second cell containing point pi and 
the x-value of the intersection point I2 is added. Finally, a cell is 
also created for p3 with a x-value of 00 since its rip is not inter- 
sected by any of the database points. Thus, the algorithm produces 
an array of three cells corresponding to points p2, pi, andps. Both 
P4 and p5 get discarded since they never provide the highest lower 
projection. 

As can be seen, the index construction algorithm keeps the re- 
gions sorted based on the x-intercepts of the boundaries. Now, 
given any query q=[xq,yq], the region g's axis lies in can be found 
by performing a binary search Xq on the index. Once the region is 
found, a comparison between the two points corresponding to the 
highest and lowest projections in that region produces the answer 

The above algorithm indexes only those points that have a chance 
of being in the top-1 answer set for any given query. Below, we 
discuss the storage and computation complexities of the proposed 
approach for a database with n points. 

1. Storage Cost: There can be at most 2n regions. For each region, 
the two points corresponding to the highest and lowest projections, 
and the x-intercept of the boundary are stored. This results in 0(n) 
storage cost. 

2. Querying Cost: For each query, a binary search is performed 
with the x-intercept of its axis. The binary search costs O(logn) 
time. Once the region is found, computation of top-1 requires con- 
stant time. Thus, the total cost is 0(log n). 

3. Index Construction Time: To construct the index structure, 
projections of each point needs to be sorted. Once sorted a mod- 
ified version of the line-sweep algorithm is performed to find the 
regions in linear time. Thus, the total cost is 0{n log n). 

When generalized to top-fc, the storage cost, querying cost, and 
index construction time are 0(kn), 0(log n + k), and 0(n log n + 
nk) respectively. 

Next, we discuss how updates can be made on an existing index. 

1. Insert: To insert a point p, a binary search is performed to iden- 
tify the region where p lies. If p does not provide the highest lower 
projection and lowest upper projection in that region, p does not 
need to be inserted. Otherwise, the left projections of p are com- 
pared iteratively to the indexed points with x-values less than that 
of p and the right projections are compared to the indexed points 
with higher x-values. The iteration continues till a left projection of 
p is intersected by the right projection of an indexed point and sim- 
ilarly the right projection of p is intersected by the left projection of 
an indexed point. The intersections define the region within which 
p provides the highest or lowest projections. The computation cost 
of the operation is bounded by 0(n). 

2. Delete: Since the top-1 index only stores points that have a 

chance of being in the answer set, no changes are necessary for 
deleting a data point p that is not in the index. Otherwise, the in- 
dex is rebuilt only within the region where p provides the highest 
or lowest projection using the same method as in Alg. 1. Note, 
we do not need to recompute or sort the projections of the data 
points since they were already computed while constructing the in- 
dex. Thus, deleting a point incurs 0(n) cost. 

4. INDEX STRUCTURE FOR TOP-K 

In this section, we develop an index structure for top-fc queries. 
A straightforward extension of the structure developed for top-1 
does not work since fc and the weighting parameters are supplied at 
runtime. Specifically, there are two main bottlenecks. First, if we 
are to extend the same technique, for each region, we need to store 
the entire ordering of projections, thus requiring O(n^) storage for 
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Figure 4: A sample database of points, pi represents database 
points and qi represent query points 

the entire index structure. Second, since the weighting parameters 
are supplied only at query time, a static ordering among the pro- 
jections cannot be assumed. The challenge is therefore to keep the 
storage requirement linear and answer top-A; queries at logarithmic 
computation cost in spite of the added complexities. We tackle this 
challenge by dividing the problem into two subproblems: first, we 
develop an index structure assuming fixed angles of projections. 
Next, we adapt the index structure for the general case where the 
weighting parameters are provided at query time. 

4.1 Fixed angle of projection 

Given fixed angle of projections, we observe that due to geo- 
metric constraints, projections of the same type are parallel to each 
other. For example, the rip of all points are parallel to each other. 
As a result, projections of the same type never intersect among 
themselves, and thus maintain a fixed ordering. Thus, projections 
can be presorted into four lists based on their type. Now, given 
query q, if we are able to identify the top-fc highest rip, lip, and top- 
k lowest rup, lup on g's axis, then the search for the top-fc points 
can be limited to the 4k points corresponding to the identified pro- 
jections (Claim 4). However, retrieving the top-fc projections on q 
is not an easy problem since projections cover only part of the 2D 
space. An example is shown in Figure 4 to illustrate the point. For 
simplicity, we only analyze the highest lower projections on g, and 
thus limit ourselves to the rip and Up of the points. For query gi, 
five projections intersect its axis: lip from pi, pa, p4, and ps and 
rip from p2 . On the other hand, for 52 , the lip from p2 intersects its 
axis while the rip does not. It can be seen, that for an lip or lup to 
intersect the query's axis, the corresponding point must be located 
to the right of the query. Analogously, points corresponding to rup 
and rip must be located to the left of the query. Thus, to solve the 
problem of finding the top-fc projections on a query, we need to 
perform a range search to retrieve the projections that intersect the 
axis. However, instead of returning all projections in the range, the 
search should only return the top-fc projections. 

Towards that goal, we develop a tree that facilitates both range 
search and top-fc queries. First, a tree is built on the x-values of 
points to facilitate range search. The tree is constructed in a manner 
similar to KD-tree [6], but on a single dimension with a branching 
factor b. At each non-leaf node, the dataset is divided in a balanced 
manner into b subsets and the process is repeated recursively. Each 
subset contains points with x-values less than or equal to a cho- 
sen separating plane. For example, if the branching factor is 2, 
the median of the points is chosen as the separating plane. The x- 
intercepts of the separating planes are stored at each non-leaf node. 
All points are stored at leaf nodes. 

Once the tree is constructed, for each point in the dataset, its lip 
and lup are projected on the line 2; = — 00, while the rup and rip 
are projected on the line x = 00. Projecting on the two extreme 
lines allows us to order all projections in the dataset. Next, at each 



Figure 5: Index structure on points in Figure 4 for top-fc query. 
Eacli non-leaf node contains the x-intercept of the separating 
plane and the tuple representing the bounds on rip and Up in 
its subtree. The tuples shown on top of the nodes represent 
the bounds after the update operation for query qi. The leaves 
contain the coordinates of the points. 



non-leaf node, we store the bounds on the highest and lowest pro- 
jections in its subtree. Specifically, the y-value of the intersection 
point between the highest rip. Up, and the lowest rup, lup with the 
relevant line, which is either = — ooorx = oo, is calculated and 
stored at the non-leaf nodes. With this information, the tree inherits 
a heap-like property. 

Example: Figure 5 presents the tree built with a branching fac- 
tor of 2 on points shown in Figure 4. For simplicity, the tree tracks 
only rip and Up. The y-values of the intersection points of the pro- 
jections with the line x = —00 and x = 00 are also shown in Fig- 
ure 5. Without loss of generality, the y-values are chosen arbitrar- 
ily, but maintain the same ordering among projections. The leaves 
are represented as dashed nodes and contain the points. Each of 
the non-leaf nodes contains the x-intercept of the separating plane, 
and a tuple representing the highest rip and Up in its subtree. 

The constructed tree allows fast answering of queries. The pseu- 
docode of the algorithm is provided in Alg. 2. Given query q, a 
range search is performed to group the points into two sets based 
on which side of g's axis they lie on. Specifically, at each node, 
the left child is chosen if Xq is less than or equal to the x-intercept 
of the separating plane, otherwise, the right child is chosen. For 
higher branching factors, the traversal is generalized in the same 
manner. The search stops at a leaf node and the path from the root 
to this node acts as the dividing plane for the points. We call this 
the " separating path" . 

Example: A range search using qi (shown in Figure 4) on the 
index in Figure 5 ends at p4, and the path from the root to p4 is 
the separating path. The Up and lup from all nodes in and right of 
the separating path intersect q 's axis. Similarly, rip and rup from 
nodes left of the separating path intersect as well. 



Algorithm 2 top-k query(/ndea;, q, fc) 

Require: g is a query point 
1 : answer <— empty array of size k 
2: cand <— empty array of size 4 
3 : r <— root of index 
4: updateBounds(r, q) 

5: insert getTop(g,llp), getTop(g,lup), getTopCg.rlp), getTopCg.rup) in 

cand 

6: while size{answer) < fc do 

7: (p, type) highest scorer in cand and corresponding projection 
type 

8: add p to answer 
9: insert getTop(g,t?/pe) to cand 
1 0: return answer 
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Algorithm 3 updateBounds(rt, q) 

Ensure: finds separating path and updates bounds on the path 
1 : if n is leaf then 
2: return 

3: llpBound max(llpBound(c/iiW))|Vc/iii(i G n, x^hud ^ 
4: lupBound ■;— min(lupBound(c/iiia!))|Vc/ii/c( G n, XchUd > Xq 
5: rlpBound ■;— max(rlpBound(c/ijZd))|Vc/ijM £ n, x^f^nd < a;^ 
6: rupBound <— min(mpBound(c/ii/ci))|Vc/jiW £ n, XchUd ^ 
7: pos <— binarysearch(array of x-intercepts of children, Xq) 
8: updateBounds(c/iiZd[pos], g) 



Now, the goal is to get the top-fe of each of the intersecting pro- 
jections. For that tasls:, an update operation is performed on the 
separating path. At each node along the separating path, bounds 
on the highest Up and lowest lup are updated to the highest and low- 
est among the children located in or right of the path. Analogously, 
bounds on the right projections are updated among nodes left of 
the separating path ( Alg. 3). The updated values for qi are shown 
on top of each node in the separating path in Figure 5. With this 
update operation, the root contains the highest and lowest values 
only among projections that are incident on the query's axis. For 
example, even though the highest Up is from p2, due to the update 
operation, the Up bound at root changes to 4 corresponding to p\ . 
Next, searching for the top-fe on a particular projection type is per- 
formed by starting at the root and traversing to the child which has 
the same value for that type. This traversal is performed recursively 
at each node till a leaf-node with the same value is reached. From 
the index construction algorithm, it is guaranteed that at any node, 
one of its children will have the same value. Thus, each search step 
deterministically takes us closer to the answer 

Example: Assume we are searching for highest Up on q\ in 
Figure 5. The Up value at root is 4, and its left child has the same 
value as well. Thus, the first step traverses to the left child of the 
root. In the next step, the correct answer pi is selected and re- 
turned. 

Once the answer is returned, the bound in the parent node is 
recalculated by ignoring the current answer Further, the change in 
bound is propagated upwards along the traversed path. The change 
in the bounds facilitates the search of the next element in top-fe. 
Once the update operation is complete, the cycle for top-1 ends. 

Example: In the update operation for qi, the parent of pi is 
updated to (2, 3) since the next highest Up in the subtree that in- 
tersects qi 's axis is p4. The change is then propagated to the root 
which gets updated to (2, 3) as well. 

For the top-fe search, once the separating path is found and the 
bounds are updated, a top-1 search is made on each of the four 
projection types (Alg. 2, line 5). Among the four points returned, 
the highest scorer is added to the answer set and the remaining 
three are retained as candidates. For the next subsequent searches, 
the search is performed only on the projection type that contributed 
to the most recent point in the answer set (line 6-9). For example, 
if in the first iteration, the highest scorer corresponds to projection 
type Up, then in the next iteration the search would be performed 
only on Up. The algorithm terminates after k + 3 searches. 

Below, we discuss the storage and computation complexities of 
the index constructed on n data points and a branching factor b. 

1. Storage Cost: Since the tree is constructed in a balanced man- 
ner, the height is log^ n. Each level of the tree follows a geometric 
progression resulting in 0{ ^l^~^ ) storage. 

2. Querying Cost: For any query, first the separating path is 
found at b log,, n time. After the update, fe -I- 3 top-fe searches are 
performed where each search consumes 26 logj n time. Further, 
whenever a point is returned, four comparisons are made for inser- 



tion into the answer set. Thus, the total querying time is boimded 

by O {kb log^ n + fe). 

3. Index Construction Time: To construct the index structure, 
first the points are sorted based on their x-values. The n log n cost 
of sorting dominates other index construction operations of project- 
ing points to a; = —oo and x = oo {0{n)) time, and addition of 
bound information on each non-leaf node (0(n log^ n)). Thus, the 
overall time complexity is 0(n log n). 

Next, we briefly discuss how the top-fe index can be adapted for a 
disk-resident version. Since, the shape of the proposed top-fe index 
structure is highly similar to B-i-tree, similar index construction 
techniques can be employed. Each node can be stored in a single 
disk page. Further, instead of a uniform branching factor, each of 
the non-leaf nodes should contain between c and 2c children, where 
c is chosen according to the disk page size. Similar to B+-trees, 
for efficient construction of the index structure, bulk-loading can 
be employed. More specifically, the data can first be sorted based 
on their x-values, and the index can then be built in a bottom-up 
manner For each level of the index, the nodes can be packed en- 
tirely full, except for the rightmost node. Once the tree is built, the 
bounds on the projections can be added. During query time, major- 
ity of the search proceeds in the same manner as in the in-memory 
version. However, since each of the leaf nodes contains multiple 
data points in the disk-resident version, a comparison among those 
points is required to identify the one with the highest score. 

Next, we discuss how updates can be performed efficiently. 
1. Insert: The insert operation for a new point p starts at the root 
node and recursively coUides to the appropriate child based on the 
x-value. At the end of the path, two cases are possible, (i) p coUides 
with another leaf node I. In this case, a new non-leaf node replaces 

1. The new non-leaf node contains both I and p. (ii) p does not 
collide with another leaf node. In this case, p is simply inserted as 
a new leaf node. After p is inserted, its projections are computed, 
and the bounds on the path to the root are updated. The insert 
operation takes 0(log(, n) time. 

2. Delete: A delete operation deletes the leaf-node corresponding 
to the chosen point and updates the projection bounds on the path 
from the leaf to the root. The operation takes 0(61og^ n) time. 

While updates can be made in an efficient manner, each update 
can make the tree unbalanced. More specifically, the height of the 
tree can exceed log^ n, resulting in slower querying times. We 
tackle this problem by keeping track of the set of leaf-nodes, U, on 
paths exceeding the length of log^ n. An unbalanced index would 
affect the querying time if the answer set contains points from U. 
Thus, when the probability of longer querying times, exceeds 
a threshold 6, we rebuild the index structure. 

4.2 Answering queries with arbitrary weight- 
ing parameters 

Although the top-fe index structure developed is efficient in an- 
swering top-fe queries, its appUcability is significantly hampered 
due to the assumption of a fixed angle of projection. Thus, to 
remove this bottleneck, we extend the index structure to handle 
queries with arbitrary weighting parameters on the repulsive and 
attractive dimensions. As discussed in the Section 2, the angle of 
projection can be computed using Eqn. 5. 

To make the index structure more flexible, we observe the fol- 
lowing properties of the projections. 

1. The angle of any projection in a quadrant lies in range [0°, 90°]. 

2. Let Score(p, 6) represent the score of a point p when projected at 
an angle 6^ on a query g's axis. For two points pi and p2 and angles 
6»i, 6»2, 6*3 such that 6*1 < 6*2 < 03, if Score(pi, 6'i)> Score(p2, 6*1) 
and Score(p2, ^2)> Score(pi, O2) then Score(p2, ^3)> Score(pi, 
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Algorithm 4 top-/cArbitraryParameters{/ndea;, g, k, a, /3) 
1: 6a arctan ^ 

2: Find indexed angles di and 0^ where 6i < 9q < 9 v. 
3: top-fc^j •<— top-A; answer set on g at 5 j 
4: top-fe^^ <- 

5 : top-fcg^ empty priority queue of maximum size k 
6: while top-fee^ <^iop-kg^ do 
7: add p to top-feg^ 

8: evaluate score of p at 6q and add (p, score(p, 0q)) to top-fcg^^ 
9: return top-/>v;_ 



63). This result follows from the geometry of the projections. 
Based on the above properties we claim the following. 

Claim 6. Let us denote the top-k answer set for q at any angle 
9 as top-kg. Assume, we have two index structures for projections 
at angles 9i and 9u and the angle of projection for a query q is 9q, 
where 9i < 9q < 9u. Then, 

top-ke„ C argmin^ top-k'e^ s.t. top-ke^ C top-kg^} (9) 

k' 

Proof by contradiction: Assume top-fee^ g top-fcg^ . 
Therefore, there is at least one point p, where p G top-kg^ and 
p ^ top-k'g^ , and therefore, p ^ top-fce, . Thus, there is some point 
p' £ top-fegj, where Score(p', 9i)> Score(p, 9i) and Score(p, 9q)> 
Score(p', 9q). Therefore, based on observation 2 in Section 4.2, 
Score(p, 9u)> Score(p', 9^) which is a contradiction since, in as- 
sumption, p' G top-fcg^ andp i^top-fc^^. □ 

Claim 6 provides a framework to compute the top-fc answer set 
on a non-indexed angle of projection if the angle is bounded be- 
tween two angles [9i, 9u] that are already indexed. First, a top-fc 
query needs to be performed on 9i . Next, the smallest enclosing 
top-fcg^ needs to be computed such that top-feei, C top-fc^^ . The 
points on top-fcg^ can now be projected based on 9q to compute the 
answer set (Alg. 4). Note that to index multiple angles, it is not 
necessary to build as many index structures. Since the information 
required to perform a range search is independent of the projec- 
tion angles, a single index structure containing the bounds on Up, 
Irp, rip, and rap for each of the indexed angles is enough. More 
specifically, at each non-leaf node, we store a hashmap. The keys 
of the hashmap consist of the indexed angles, and each key points 
to a bucket containing the bounds on the projections for that angle. 
With the addition of information on multiple angles, the storage 
cost of the index stracture is 0(n + m^^^Ej) where m is the num- 
ber of indexed angles. 

Alg. 4 outlines the procedure to compute the top-fe answer set 
on a non-indexed angle. Given an index stracture, first, the two 
consecutive indexed angles are found between which the query an- 
gle lies (lines 1-2). Next, the top-fc is computed on the smaller of 
the two indexed angles, 9i (line 3). Once top-fcg, is computed, top- 
k'9u is populated by computing top-1 repeatedly on angle du till 
all elements in top-A;e, are fetched. The score at 9q is computed for 
each of the points fetched in the while loop and maintained in a pri- 
ority queue (lines 4-8). Finally, the answer set is returned (line 9). 
Note, the separating path is computed only once while computing 
top-kg, since it is independent of the projection angles. 

Choosing angles to index: Since a non-indexed angle needs to 
be bounded within two indexed angles, to cover the entire range of 
query angles, 0° and 90° are the two recommended angles. The 
choice of additional angles to index can be guided by two fac- 
tors: domain knowledge/history of previous queries, and the main- 
memory budget. Based on the memory budget, the number of an- 
gles to index can first be computed. Next, if the distribution of 
query projection angles is available, then angles can be indexed 



based on samples drawn (without replacement) from that distribu- 
tion. Otherwise, a uniform distribution of angles between 0° and 
90° can be chosen. As shown later in Section 6, an index stracture 
on five angles chosen uniformly is highly efficient in answering the 
proposed queries. Furthermore, choosing angles from the tmiform 
distribution represents the worst case scenario. Availability of more 
information can only improve the performance. 

5. EXTENSION TO HIGHER DIMENSIONS 

In this section, we generalize the top-fc algorithm to higher di- 
mensions. For both top-1 and top-fc queries, the core of the devel- 
oped algorithms is based on projections. In the 2D space, projec- 
tions take the shape of a Une, and thus the index stractures con- 
centrate on indexing line segments. However, in dimensions higher 
than two, projections take the form of hyperplanes making the prob- 
lem much harder. Thus, to keep the problem tractable for higher di- 
mensions, we first divide the problem into 2D and ID subproblems 
and then solve them individually. Next, we aggregate the subprob- 
lems to produce the final answer set. 

Eqn. 3 defines the SD-score for any number of dimensions. Re- 
call, D contains the dimensions which are desired to be distant from 
the query (repulsive), whereas § represents the dimensions desired 
to be similar (attractive). Next, we define subsets M C ID) and 
N C S where |M| = |N| = mm(|D|,|S|). Further, a bijective 
function / : M — > N is defined that allows us to map each repul- 
sive dimension in M to an attractive dimension in N. Based on this 
pairing, the problem is divided into 2D and ID subproblems. More 
specifically, Eqn. 3 is reexpressed as the following. 

SD-score(p,q) = ^ ai\qi - pi\ - I3j\qj - pj\ \ 

\ieM,j=f(i) J 

\iS(D\M) / V-''S(S\N) / 

(10) 

As evident from Eqn. 10, each summation corresponds to a sub- 
problem. The first summation in Eqn. 10 involves two dimensions 
and is solved using the top-fc index stracture developed in Section 
4. The other two subproblems, corresponding to the second and 
third summations, involve dimensions that need to be solved indi- 
vidually. The proposed formulation maximizes the number of 2D 
subproblems and consequently, minimizes the total number of sub- 
problems. Eqn. 10 can be completely reduced to 2D subproblems 
if the cardinalities of D and S are same. Otherwise, there will al- 
ways be some dimension that cannot be paired up to take advantage 
of the top-fc index structure. As the value of ||D| — |§|| increases, 
SD-score reduces to the familiar top-fc similar or distant queries. A 
high difference in the cardinalities allows less opportunity to em- 
ploy the developed index stracture for better praning of the search 
space. 

Example: Consider the database of advertisement publishers 
with attributes D = {Price}, and S = {HitRate, Coverage} 
shown in Figure 6. A possible strategy would he to divide the prob- 
lem into two subproblems. The first is a 2D subproblem by pair- 
ing 'Price' with 'Hit Rate', and the second is a ID subproblem on 
'Coverage'. 

We now discuss how each of the subproblems can be solved and 
then aggregated to compute the final answer. As already men- 
tioned, the 2D subproblems can be solved using the index strac- 
ture developed in Section 4. We therefore concentrate on how the 
ID subproblems can be solved efficiently. Solving on a single di- 
mension d is performed using a bidirectional search. As part of 
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pre-processing, each dimension is maintained in a sorted container. 
For a given query q, two pointers are maintained to track candidates 
for top-fc. If dimension d is desired to be distant, tfien tfie pointers 
are initialized to data points corresponding to the first and last el- 
ements in the dimension. On the other hand, if d is desired to be 
similar, then a binary search is performed with qd on the selected 
dimension, where qd represents the value of q in dimension d. If 
position pos is returned from the binary search then, the pointers 
are initialized to points corresponding to the elements in position 
pos and pos — 1 in dimension d. In essence, after initialization, the 
pointers point to those two elements that have a chance of being 
the top-1 point for the subproblem involving dimension d. During 
query-time on dimension d, the pointers of d are used to fetch the 
two candidates and the better candidate is returned as the answer. 
Further, the pointer corresponding to the answer is updated to the 
immediate "unexplored" neighbor. 

Example: To solve the ID problem on the database shown in 
Figure 6 for the dimension on 'Coverage ', first, it is reordered in the 
shown sorted container For the given query, a binary search is per- 
formed with its 'Coverage' value of 75. As a result of this search, 
the two pointers are assigned to < B,80 > and < C, 68 >. Next, 
to determine the most similar data point on 'Coverage ', a compar- 
ison is performed only between B and C. Once B is returned, the 
pointer to B is updated to D. 

The top-fc answer set is computed in an iterative manner. At each 
iteration, the top point is fetched for each of the subproblems. For 
each of those points, its score against the query is calculated and 
added to a priority queue of size k. Further, a threshold is calcu- 
lated which plugs the scores from each of the individual subprob- 
lems into Eqn. 10 based on the type. The threshold provides an 
upper bound on the score of any point that has not been explored 
yet. Thus, the algorithm stops iterating if the kth element in the pri- 
ority queue has a score higher than or equal to the threshold. The 
stopping criterion is the same as in Threshold Algorithm [8] and is 
guaranteed to return an optimal solution. 

Example: Expanding on the previous two examples, to com- 
pute the top-k answer set on the publishers' database, in the first 
iteration, the highest scoring points for each of the subproblems, 
a 2D subproblem on 'Price ' and 'Hit Rate ' and a ID problem on 
dimension 'Coverage', are fetched. Assume the weighting param- 
eters are all equal to 1. Thus, data points A and B are returned 
as the answers for the 2D and ID subproblems with scores of 90 
and 5 respectively. Therefore, the threshold is 90 — 5 = 85 since 
the 2D subproblem corresponds to the first summation in Eqn. 10 
and the ID problem corresponds to the third summation. Further, 
the scores for A and B, 40 and 45 respectively, are calculated (on 
the entire query rather than just the subproblems) and added to the 
priority queue. Similarly, in the next iteration, the second highest 
scoring points for each of the two subproblems are fetched (C for 
both .subproblems), the threshold is updated to 68, and the points 
are inserted into the priority queue. The iteration stops when the 
kth element in the priority queue has a score higher than or equal 
to the threshold. Ifk = l, then the computation would stop at the 
second iteration since SDscore ofC is same as the threshold. 

Although the thresholding technique is similar to TA, the key dif- 
ference lies in the granularity of the subproblems. In TA, and many 
other top-fc techniques [7, 8, 19], each subproblem corresponds to 
a single dimension. However, in our approach, each subproblem is 
composed of two dimensions. As a result, a high performance gain 
is achieved. The impact of the granularity of the subproblems is 
most prominently evident in scalability against dimension. It is im- 
portant to note that scalability against dimension for top-fc queries 
is a particularly hard problem. In [18], the authors show that some 
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Figure 6: A sample database and query involving 3 dimensions. 



of the best-performing techniques [3, 10, 18] suffer from scalability 
bottlenecks above dimensions of size 3. Thus, achieving close to 
optimal performance on two-dimensional datasets provides a sig- 
nificant boost in raising the dimensionality bar for top-fc queries. 
We quantify the details of the performance gains in Section 6. 

6. EXPERIMENTS 

In this section, we report the experimental results that validate 
the efficiency of the proposed techniques and highlight the applica- 
bility on a real dataset. 

6.1 Experimental Setup 

For a thorough evaluation, we use synthetic datasets of up to 
ten million points generated from uniform, correlated and anti- 
correlated distributions, and a real dataset of chemical molecules. 
The index structure for top-fc is built on 5 angles distributed uni- 
formly across 90°: 0, 23, 45, 67, 90. We choose the a and /3 
parameters from a uniform distribution between and 1. Unless 
specifically mentioned, fc is set to 5. All experiments are performed 
on 100 randomly selected points from a uniform distribution. 

For benchmarking, we choose sequential scan, and main-memory 
based adapted versions of TA [8], Branch-and-Bound Processing 
of Ranked Queries (BRS) [15], and Progressive Exploration (PE) 
[19]. Note that both BRS and PE were originally designed for disk- 
resident index structures. We use them for the benchmarking stud- 
ies since no other main-memory based technique exists that is able 
to handle non-monotonic functions. 

To adapt TA for the proposed class of functions, an ordered list 
of the data points is maintained for each dimension. Given a query, 
a binary search is performed to fetch the farthest point on each of 
the repulsive dimensions and the closest points on the attractive di- 
mensions. The pruning threshold is computed based on the points 
fetched. To adapt BRS, an in-memory R*-tree is constructed. The 
node capacity is selected by optimizing the querying performance 
on a database of 10000 points drawn from a uniform distribution. 
Based on this optimization, the selected node capacities are 28, 
16, 12, and 9 for dimensions 2, 4, 6 and 8 respectively. Given a 
query, the space is divided into regions such that in each region, the 
scoring function is either monotonically decreasing or increasing 
with each dimension. Using the algorithm outlined for constrained 
top-fc queries in BRS, top-fc queries are performed simultaneously 
on each region and the scores are maintained in a single priority 
queue to compute the answer set. All algorithms are implemented 
in Java using SUN JDK 1.6.0. The experiments are performed on a 
3.2GHz, 4GB memory PC nmning Debian Linux 4.0. 

6.2 Quantitative Analysis 

First, we benchmark the performance of the proposed index struc- 
ture in the multi-dimensional case. Figures 7a-7c demonstrate the 
growth rate of querying time against dataset size for 6-dimensional 
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Figure 7: Growth rate of querying time against dataset size (a-c), number of dimensions (d-f), k (g-h), and number of attractive 
dimensions (i-j) on uniform, correlated and anti-correlated distributions for SD-Index top-fc, BRS [15], TA [8], and PE [19]. 



points. Three dimensions each are chosen for distance and sim- 
ilarity. Clearly, SD-Index top-fc displays the lowest growth rate 
across all three distributions. Although, both TA and SD-Index 
top-fe are based on aggregation of subproblems to compute the fi- 
nal answer set, each subproblem for SD-Index comprises of two 
dimensions, whereas TA treats each dimension as a subproblem. 
Due to this basic difference in granularity, SD-Index achieves a bet- 
ter performance. While BRS performs better than TA on the uni- 
formly distributed dataset, TA achieves a superior performance on 
the correlated and anti-correlated datasets. The bounds computed 
from the minimum bounding rectangles (MBR) in the uniformly 
distributed dataset are tighter than the bounds in the correlated and 
anti-correlated datasets. As a result, a higher pruning is achieved. 
Compared to sequential scan, SD-Index performs better by an or- 
der of magnitude. The performance of PE is similar to sequential 
scan on 6-dimensional datasets. 

In Figures 7d-7f, we analyze the scalability of SD-Index top- 
fc against dimension size. Due to the significantly weaker perfor- 
mance of PE compared to the other methods, we exclude the tech- 
nique from the remaining benchmarking studies on querying costs. 
In the top-fc setting, well-known techniques have been shown to 
suffer at dimensions above 3 [18]. Further, none of the recent top-fc 
techniques [15, 17, 19,20] have analyzed datasets with dimensions 
above 8. However, even under this context, SD-Index achieves a 
superior performance at all dimension sizes. As can be seen, both 
TA and BRS start performing worse than sequential scan at dimen- 
sions above 6. As observed in most index structures, sequential 
scan performs best in terms of scalability. BRS, which is based on 
hierarchical index structures, suffers at higher dimensions due to 
the inherent issue of curse of dimensionality [2]. The bounds de- 
rived from the MBRs for BRS get progressively looser with higher 
dimensions. On the other hand, for both SD-Index and TA, more 
iterations need to be performed for a given fc at higher dimensions. 
However, since each subproblem in SD-Index comprises of two di- 
mensions, the number of iterations is significantly less than TA. 
This results in superior scalability against dimension when com- 
pared to TA, and in general, against the approach of considering 
each dimension as a subproblem for solving top-fc queries [7,8, 19]. 



Figures 7g-7h analyze the growth rate of the querying time against 
fc on a 6-dimensional dataset as fc is varied from 5 to 100. We ex- 
clude the plot for anti-correlated dataset since the growth rate is 
similar to the coiTelated database. As can be seen, SD-Index top-fc 
outperforms sequential scan, BRS, and TA for all values of fc. 

Figures 7i-7j analyze the performance of SD-Index top-fc when 
the number of attractive dimensions is varied. We vary the number 
of attractive dimensions from to 3 to investigate all possible pair- 
ing scenarios. As can be seen, SD-Index top-fc achieves a superior 
performance at all scenarios except when the number of attractive 
(or repulsive) dimensions is 0. In this setting, SD-Index top-fc de- 
generates into the adapted version of TA and as a result, the per- 
formance suffers. It is important to note however, that the basic 
assumption in the proposed problem is the existence of at least one 
attractive and repulsive dimensions. Otherwise, the problem trans- 
lates to a simple distance or similarity query. 

Figure 8a analyzes how updates affect the performance of the 
SD-Index top-fc index structure on points drawn from uniform and 
correlated distributions. We do not analyze the SD-Index top- 1 in- 
dex since its performance is invariant to updates. To analyze the 
growth rate of querying time with index updates, we deleted and 
inserted an equal number of randomly selected points to maintain 
the same index size. Next, we investigated the effect of the updates 
on the querying cost. SD-Index shows the time without updates, 
and SD-Index* depicts the querying time after updates. Note that 
an a;-axis value of 1000 represents 1000 deletes and 1000 inserts re- 
sulting in a total of 2000 updates. As can be seen, the growth rate of 
the querying time is minimal. Figure 8b demonstrates the growth 
rate of insertion cost with database size on the four shown index 
structures. As can be seen, SD-Index top-1 achieves the fastest in- 
sertion cost. Although SD-Index top- 1 has a worst case insertion 
cost of 0{n), in the average case, it is much faster. Recall that SD- 
Index top-1 only stores points that can be in the answer set. Since 
majority of the points do not satisfy this criteria, only an 0(log m) 
cost is imparted, where m is the number of indexed points. Further- 
more, even if the point being inserted can possibly be in the answer 
set, an 0{n) cost is required only when m — n, and projections of 
the point being inserted do not intersect with projections of any of 
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Figure 8: (a) Growth rate of querying cost witli updates, (b) Insertion Cost Vs Dataset Size, (c-g) Experiments on 2D data: Querying 
time Vs. Dataset Size for SD-Index top-fc (c-d), top-1 (e), Querying Time Vs k (f-g). (li-i) Growth rate of memory footprint against 
dataset size (h) and branching factor (i). (j) Index Construction Time Vs Dataset Size. 



the indexed points. Deletions show a similar behavior. 

The above results highlight the superiority of SD-Index over TA, 
BRS, PE and sequential scan in the multi-dimensional case. Now, 
we verify the performance on 2-dimensional points to demonstrate 
the performance gain achieved on each of the subproblems. 

Figures 8c-8d show the performance of SD-Index top-fc against 
dataset size on 2-dimensional points. We omit the anti-correlated 
distribution since the result is similar to the performance on the cor- 
related distribution. As can be seen, SD-Index top-fc performs bet- 
ter than TA and BRS by more than two and one orders of magnitude 
respectively. Furthermore, the result shows that due to the signifi- 
cant performance gap between BRS and SD-Index in the 2D case, 
for high-dimensional datasets, even if BRS is adapted in a manner 
similar to the proposed strategy of 2D subproblems, SD-Index is 
still likely to achieve a better performance. Figure 8e repeats the 
same experiment on the top-1 index structure. Similar to the result 
in the top-fc setting, a performance gain of more than two orders of 
magnitude over sequential scan is observed. Among the data dis- 
tributions, a better performance is achieved in the correlated and 
anti-correlated distributions due to their smaller index sizes. Over- 
all, both the top-1 and top-fc index structures display a sub-linear 
growth rate, which is consistent with the theoretical bounds. 

Figures 8f-8g show the growth rate of querying time against k in 
a 2-dimensional dataset of 10 million points. As can be seen, SD- 
Index top-fc outperforms both BRS and TA by a significant mar- 
gin. This result also highlights the speed-up that is achieved when 
a subproblem is composed of two dimensions. Due to this cru- 
cial difference in the computation costs of the subproblems, when 
scaled to higher dimensions or larger k, SD-Index achieves a better 
performance. 

Next, we analyze the memory footprint and construction times 
of the index structures. Figure 8h depicts the growth rate of the 
memory footprint on a 6-dimensional dataset. Since the index size 
of SD-Index top-1 is dependent on the data distribution, we inves- 
tigate the growth rate across all three distributions. As expected, 
the memory footprint of SD-Index top-1 is much less than the top- 
fc version. This result showcases the advantage of SD-Index top-1 



over top-fc when fc is known apriori. For top-1, datasets drawn 
from correlated and anti-correlated distributions require less mem- 
ory since points located at the comers dominate points located be- 
tween them. As a result, a large number of points can be dis- 
carded from the index structure. Figure 8i analyzes the storage cost 
against the branching factor for SD-Index top-fc. Figure 8j demon- 
strates the growth rate of the index construction time against dataset 
size. As expected, SD-Index top- 1 has the fastest index construc- 
tion time. BRS achieves a marginally faster index construction time 
than SD-Index top-fc. 

6.3 Qualitative Analysis 

In this section, we examine the quality of the top-fc results on 
a real dataset of chemical molecules and demonstrate its applica- 
tion in data analysis. For this purpose, we choose the ChEMBL ^ 
dataset that contains 428,913 bioactive "drug-like" molecules along 
with calculated properties such as drug-likeness score, logP value, 
molecular weight (MW), etc. Drug-likeness is a concept used in 
drug design to estimate how "drug-like" a prospective compound is. 
A good drug should show good availability, low toxicity and high 
potency. The well-established Lipinski's riile-of-five [11] specifies 
four filters to screen drug-like molecules. Among them, one of 
the filters states that for a molecule to be drug-like, its MW has 
to be below 500. For this experiment, we search for exceptions to 
this rule and check whether those molecules follow any pattern that 
makes them drug-like in spite of being overweight. 

To find exceptions to Lipinski's rule, we queried on a molecule 
with a high drug-likeness score of 1 1 and low MW of 250 for sim- 
ilarity on drug-likeness and distance on MW. As a reference point, 
the highest drug-likeness score and lowest MW in the dataset are 
14.22 and 12.01 respectively. The rationale behind the query is to 
check if overweight drug-like molecules exist. Table 1 summarizes 
the result. The first row presents the overall average on the selected 
features, whereas the next four rows present the average for the 
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Table 1: Statistics on top-A; results 



Description 


Drug-likeness score 


MW 


PSA 


Overall Average 


8.94 


422.6 


112.14 


fc=10 


9.87 


938.67 


27.73 


fc=50 


9.47 


897.5 


42.17 


fc=100 


9.18 


877.79 


42.23 


fe=200 


9.14 


824.24 


47.46 



corresponding values of k. Along with the drug-likeness score and 
MW, Table 1 also presents the polar surface area (abbrv. PSA). 

Table 1 indicates that molecules with high MW can also be drug- 
like. For all four values of k, the average drug-likeness score is 
more than the overall average, in spite of the average MW in the 
top-fc sets being twice or more of the overall average. A more inter- 
esting pattern however, is observed on the PSAs of the molecules. 
PSA is defined as the sum of the surfaces of polar atoms, usu- 
ally oxygens, nitrogens and attached hydrogens, in a molecule. As 
shown in Table 1, PSAs of the top-fc molecules are much smaller 
than the global average. Interestingly, a low PSA has been shown to 
correlate with human intestinal absorption, and blood-brain barrier 
penetration making them a good indicator of drug-likeness [16]. 

The result indicates one of the scenarios where a molecule should 
not be discarded because of high MW and thereby highlighting an 
application of SD-query. Traditional distance or similarity based 
top-fc queries fail to identify these molecules due to the limitation 
of the scoring function. On the other hand, SD-query takes a query- 
centric approach that is capable of producing a more interesting 
answer set and thus allowing a deeper analysis of a given dataset. 

7. RELATED WORK 

Top-A; queries have been an active research area in the database 
community. Fagin [7] first introduced the rank aggregation prob- 
lem for multimedia databases. Since then, an array of techniques 
has been developed to index top-fc queries [3,4,8-10, 12, 18,20]. 
In the first category, the methods sort each dimension and compute 
the answer set by making parallel access across dimensions. In the 
layer-based category, data points are divided into layers to derive 
an ordering. Points in a layer of higher precedence get inspected 
before points in a layer of lower precedence. In the view-based 
category, materialized views are used to answer top-fc queries. Un- 
fortunately, none of the above techniques are capable of handling 
non-monotonic functions. 

Among existing techniques, only PE [19] and BRS [15] consider 
non-monotonic functions. Both [19] and [15] assume datasets to 
be indexed by disk-resident hierarchical indices and then develop 
strategies for scoring functions without the assumption of mono- 
tonicity. BRS assumes data points to be indexed by a hierarchical 
index structure such as R-tree, and then computes bounds on the 
MBRs on any given scoring function to determine whether to fur- 
ther explore its child nodes. PE on the other hand assumes each 
attribute to be indexed by a hierarchical index structure. Given 
a query function, it proceeds by exploring the joint space of the 
indices and computes bounds on whether to explore further. PE 
defines a special class of semi-monotone functions and develops 
effective pruning strategies for that class. 

The key property of both BRS and PE is that the same index 
structures can be employed to answer a wide range of scoring func- 
tions including the proposed class of linear scoring functions. Cer- 
tainly, BRS and PE are able to handle a wider range of functions 
than SD-Index. However, for the proposed class of scoring func- 
tions, SD-Index achieves a superior performance due to the pre- 
computation based strategy of indexing the isolines. Furthermore, 



both BRS and PE are optimized for disk-based index structures, 
whereas we focus on the memory-resident case. 

8. CONCLUSION 

In this paper, we formulated a novel top-fc query on a scoring 
function that combines the idea of repulsive and attractive dimen- 
sions. We developed two index structures for answering top-1 and 
top-fc queries in a scalable manner. Our unique strategy of indexing 
the isolines of the scoring function achieved a performance gain of 
one to two orders of magnitude over existing techniques. The ap- 
plication of the proposed class of scoring functions is highlighted 
in the qualitative analysis of a molecular dataset to identify "drug- 
Uke" molecules that deviate from an established rule. For future 
work, we plan to investigate two primary issues. First, for higher 
dimensions, the isolines take the form of hyperplanes which are 
much harder to index. Thus, we want to study whether an index 
structure can be developed to index hyperplanes directly. Second, 
while scaling to higher dimensions, the mapping between attrac- 
tive and repulsive dimensions is currently performed in an arbitrary 
manner. We plan to analyze whether a more focused strategy can 
be developed in defining the mapping function since a better map- 
ping would lead to more effective pruning of the search space. 
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